Skip to content

fix: auto-purge corrupted buildkit cache on integrity check failure#101

Open
devin-ai-integration[bot] wants to merge 1 commit into
mainfrom
devin/1779992772-auto-purge-corrupted-cache
Open

fix: auto-purge corrupted buildkit cache on integrity check failure#101
devin-ai-integration[bot] wants to merge 1 commit into
mainfrom
devin/1779992772-auto-purge-corrupted-cache

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented May 28, 2026

Summary

When the BoltDB integrity check fails at startup, this PR automatically wipes /var/lib/buildkit so buildkitd starts fresh, breaking the corruption persistence loop.

The problem: Previously, when checkBoltDbIntegrity failed at startup (line 490), the code only logged core.error("BoltDB integrity check failed") and proceeded to use the corrupted disk. At commit time, a failed integrity check correctly skips the commit — but the previously committed corrupted snapshot remains the latest. The next job clones that same corrupted snapshot, creating an infinite loop that can only be broken by manual disk deletion.

The fix: On integrity check failure at startup:

  1. Wipe all contents of /var/lib/buildkit (using find -mindepth 1 -delete to preserve the mount point)
  2. Report the auto-purge event to the backend via reportBuildPushActionFailure
  3. Proceed to start buildkitd with a clean state

The build will have a cache miss that one time, but the fresh databases will pass the post-action integrity check and be committed, restoring the cache for subsequent builds.

Customer context: tandemtechnology reported that releases were taking much longer since ~4PM ET on May 27. Root cause was a corrupted sticky disk. They had to manually delete the disk after we identified the issue.

Review & Testing Checklist for Human

  • Verify the find -mindepth 1 -delete command correctly removes all buildkit contents while preserving the mount point itself (the mount point /var/lib/buildkit must remain since it's a sticky disk mount)
  • Confirm that after auto-purge, buildkitd can start successfully from an empty /var/lib/buildkit directory (it should create fresh database files)
  • Test that the post-action integrity check passes on fresh databases and commits the clean state
  • Consider whether a Grafana alert on reportBuildPushActionFailure with event boltdb integrity auto-purge would be valuable for proactive monitoring
  • Consider edge case: if the integrity check itself is causing false positives (e.g., due to timeout/OOM), this would unnecessarily purge good caches. The existing timeout/OOM handling in checkBoltDbIntegrity treats those as "skip" (returns true), so this should be safe.

Recommended test plan: Deliberately corrupt a boltdb file in a test sticky disk, then run a build using setup-docker-builder and verify:

  1. The integrity check detects the corruption
  2. The cache is auto-purged
  3. buildkitd starts fresh
  4. The build completes (with cache miss)
  5. The fresh state is committed
  6. The next build uses the new clean cache

Notes

  • Files > 400MB are already skipped by the integrity check (can't mmap in the 512MB memory limit). For very large corrupted databases, buildkitd startup failure would be the symptom — a separate mechanism (e.g., buildkitd startup timeout + retry with fresh state) could address that case in the future.
  • The reportBuildPushActionFailure call sends the event to the Blacksmith backend API at /stickydisks/report-failed, providing visibility into auto-purge events.

Link to Devin session: https://app.devin.ai/sessions/4fe8e10b4d484f9787153e9ef5c88053


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag @codesmith with what you need. Autofix is disabled.

When the BoltDB integrity check fails at startup, wipe /var/lib/buildkit
so buildkitd starts fresh. Previously, the code only logged an error and
proceeded with the corrupted disk, creating a corruption persistence loop
where the same corrupted snapshot was cloned on every subsequent build.

The build will have a cache miss that one time, but the fresh state will
be committed at the end, breaking the loop automatically.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant